This case study analyzes 2025 Cyclistic bike-share data to answer the question:
How do annual members and casual riders use Cyclistic bikes differently?
Using R and the tidyverse, I cleaned,
processed, and analyzed the dataset to explore patterns in ride
frequency, time of year, day of
week, time of day, ride
length, and bike type. The goal is to identify
meaningful differences in behavior between rider types and provide
actionable insights for marketing and operational strategies to convert
single-ride and day-pass (“casual”)
riders into annual “members”.
Install and load relevant packages.
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("kableExtra")
library(tidyverse)
library(knitr)
library(kableExtra)
Get a list of all CSV files in the data folder.
csv_file_paths <- list.files(path = "data/2025-divvy-tripdata", pattern = "*.csv", full.names = TRUE)
Read all csv files and combine them into a single tibble data frame.
tripdata_2025_combined <- map_dfr(csv_file_paths, read_csv)
Let’s take a look at the raw combined data in our frame variable.
glimpse(tripdata_2025_combined)
## Rows: 5,552,994
## Columns: 13
## $ ride_id <chr> "7569BC890583FCD7", "013609308856B7FC", "EACACD3CE0…
## $ rideable_type <chr> "classic_bike", "electric_bike", "classic_bike", "c…
## $ started_at <dttm> 2025-01-21 17:23:54, 2025-01-11 15:44:06, 2025-01-…
## $ ended_at <dttm> 2025-01-21 17:37:52, 2025-01-11 15:49:11, 2025-01-…
## $ start_station_name <chr> "Wacker Dr & Washington St", "Halsted St & Wrightwo…
## $ start_station_id <chr> "KA1503000072", "TA1309000061", "13235", "13235", "…
## $ end_station_name <chr> "McClurg Ct & Ohio St", "Racine Ave & Belmont Ave",…
## $ end_station_id <chr> "TA1306000029", "TA1308000019", "13278", "13071", "…
## $ start_lat <dbl> 41.88314, 41.92915, 41.94823, 41.94823, 41.94823, 4…
## $ start_lng <dbl> -87.63724, -87.64915, -87.66407, -87.66407, -87.664…
## $ end_lat <dbl> 41.89259, 41.93974, 41.94553, 41.94374, 41.94374, 4…
## $ end_lng <dbl> -87.61729, -87.65887, -87.64644, -87.66402, -87.664…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
We can confirm here that started_at and ended_at are in date/time format.
We’ll start by creating a cleaned working dataset.
We want to preserve the original raw dataset and perform all
transformations on a cleaned copy.
tripdata_2025_cleaned <- tripdata_2025_combined
Let’s start by checking if any columns have missing data.
tibble(
column = names(tripdata_2025_cleaned),
missing_count = colSums(is.na(tripdata_2025_cleaned))
) %>%
mutate(missing_count = format(missing_count, big.mark = ",")) %>%
kable(
col.names = c("Column to Check", "Missing Data Count")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Column to Check | Missing Data Count |
|---|---|
| ride_id | 0 |
| rideable_type | 0 |
| started_at | 0 |
| ended_at | 0 |
| start_station_name | 1,184,673 |
| start_station_id | 1,184,673 |
| end_station_name | 1,243,305 |
| end_station_id | 1,243,305 |
| start_lat | 0 |
| start_lng | 0 |
| end_lat | 5,535 |
| end_lng | 5,535 |
| member_casual | 0 |
There do seem to be a lot of missing station names and ids, but since
these aren’t as important for our analyses, we can safely ignore them.
For a more detailed analysis, we might be able to impute some of these
missing values by matching latitude and longitude values.
Out of the important columns for our analyses, it looks like only
end_lat and end_lng have missing data (5,535
each).
As a quick follow-up, we can see if any of our rows with missing end_lat or end_lng values have end_station_name or end_station_id values we could try to match up to impute the values.
tripdata_2025_cleaned %>%
filter(is.na(end_lat) | is.na(end_lng)) %>%
summarise(
missing_station_name = sum(is.na(end_station_name)),
missing_station_id = sum(is.na(end_station_id))
) %>%
mutate(
missing_station_name = format(missing_station_name, big.mark = ","),
missing_station_id = format(missing_station_id, big.mark = ",")
) %>%
kable(
col.names = c("Missing End Station Name", "Missing End Station ID")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
| Missing End Station Name | Missing End Station ID |
|---|---|
| 5,535 | 5,535 |
It seems that all rows with missing end_lat or end_lng values are also missing these values, so we don’t have a reliable way to impute the data. With only 5,535 missing values for each, representing a very small fraction of the dataset, this should be fairly negligible for our analyses.
Since our analysis depends on comparing rider types, we should verify that the member_casual column only contains either member or casual.
tripdata_2025_cleaned %>%
count(member_casual) %>%
mutate(n = format(n, big.mark = ",")) %>%
kable(
col.names = c("Rider Type", "Count")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
| Rider Type | Count |
|---|---|
| casual | 1,999,497 |
| member | 3,553,497 |
This looks correct, as we only have two unique values in the
member_casual column.
We can also see that there are about 78% more member rides than
casual rides.
Since there are no missing or incorrect values in
member_casual, we know all trips can be categorized by rider
type and we don’t need to filter anything out yet.
Now we’ll check if any ended_at date/times are the same as
the started_at date/times.
This could still be valid if the start and end stations are the same due
to user canceling the trip or some technical issue.
sum(tripdata_2025_cleaned$ended_at == tripdata_2025_cleaned$started_at, na.rm = TRUE)
## [1] 0
No started_at and ended_at times are equal.
Now let’s check if any ended_at date/times occur before the started_at date/times.
sum(tripdata_2025_cleaned$ended_at < tripdata_2025_cleaned$started_at, na.rm = TRUE)
## [1] 29
We do have an issue here with 29 ended_at date/times that come before the started_at.
We’ll remove the rows where ended_at is before
started_at, since these represent invalid trip times.
With the amount of data we have, removing them should have a minimal
effect on our dataset.
tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
filter(ended_at > started_at)
We’ll add three new columns. Ride_length will be in minutes for the trip duration. Month and day_of_week will be ordered factors showing the full label.
tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
mutate(
ride_length = as.numeric(difftime(ended_at, started_at, units="mins")),
month = month(started_at, label = TRUE, abbr = FALSE),
day_of_week = wday(started_at, label = TRUE, abbr = FALSE)
)
Let’s take a look to make sure the new columns were added correctly.
glimpse(tripdata_2025_cleaned)
## Rows: 5,552,965
## Columns: 16
## $ ride_id <chr> "7569BC890583FCD7", "013609308856B7FC", "EACACD3CE0…
## $ rideable_type <chr> "classic_bike", "electric_bike", "classic_bike", "c…
## $ started_at <dttm> 2025-01-21 17:23:54, 2025-01-11 15:44:06, 2025-01-…
## $ ended_at <dttm> 2025-01-21 17:37:52, 2025-01-11 15:49:11, 2025-01-…
## $ start_station_name <chr> "Wacker Dr & Washington St", "Halsted St & Wrightwo…
## $ start_station_id <chr> "KA1503000072", "TA1309000061", "13235", "13235", "…
## $ end_station_name <chr> "McClurg Ct & Ohio St", "Racine Ave & Belmont Ave",…
## $ end_station_id <chr> "TA1306000029", "TA1308000019", "13278", "13071", "…
## $ start_lat <dbl> 41.88314, 41.92915, 41.94823, 41.94823, 41.94823, 4…
## $ start_lng <dbl> -87.63724, -87.64915, -87.66407, -87.66407, -87.664…
## $ end_lat <dbl> 41.89259, 41.93974, 41.94553, 41.94374, 41.94374, 4…
## $ end_lng <dbl> -87.61729, -87.65887, -87.64644, -87.66402, -87.664…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
## $ ride_length <dbl> 13.957950, 5.072400, 11.591667, 3.570550, 2.573817,…
## $ month <ord> January, January, January, January, January, Januar…
## $ day_of_week <ord> Tuesday, Saturday, Thursday, Thursday, Thursday, Th…
Looks good to me.
Finally, we’ll sort the whole dataset by the started_at
date/time.
This is optional, as later analyses don’t require sorted data, but it’s
helpful to know the trips are now sorted in chronological order.
tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
arrange(started_at)
Before beginning analysis, we verified the integrity of the dataset, checked for missing or invalid values, and assessed its reliability, objectivity, and potential biases.
To understand overall usage, we first look at the number of rides taken by each rider type.
tripdata_2025_cleaned %>%
count(member_casual) %>%
mutate(n = format(n, big.mark = ",")) %>%
kable(
col.names = c("Rider Type", "Number of Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
| Rider Type | Number of Rides |
|---|---|
| casual | 1,999,488 |
| member | 3,553,477 |
As we noted before, there are roughly 78% more member rides than casual rides.
To visualize that in a quick bar chart:
# Let's store hex colors in a vector for casual and member to reuse in other charts
ride_type_colors <- c("member" = "#619CFF", "casual" = "#F8766D")
tripdata_2025_cleaned %>%
count(member_casual) %>%
ggplot(aes(x = member_casual, y = n, fill = member_casual)) +
geom_col(width = 0.4, show.legend = FALSE) +
geom_text(aes(label = scales::comma(n)), vjust = -0.6, size = 4) +
scale_fill_manual(values = ride_type_colors) +
scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.2))) +
labs(
x = "Rider Type",
y = "Number of Rides"
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 11, margin = margin(t = 8)),
axis.title.y = element_text(size = 11, margin = margin(r = 8))
)
We begin examining when rides occurred using monthly ride counts. Seasonal time of year may show differences between casual and member rides, so we’ll look at counts by month. While it is reasonable to expect that warmer months will have more rides overall, our focus here is on identifying any differences in seasonal patterns between rider types.
monthly_counts <- tripdata_2025_cleaned %>%
count(month, member_casual) %>%
pivot_wider(
names_from = member_casual,
values_from = n
) %>%
arrange(month)
monthly_counts %>%
mutate(
casual = format(casual, big.mark = ","),
member = format(member, big.mark = ",")
) %>%
kable(
col.names = c("Month", "Casual Rides", "Member Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Month | Casual Rides | Member Rides |
|---|---|---|
| January | 24,124 | 114,527 |
| February | 27,757 | 124,144 |
| March | 85,862 | 212,268 |
| April | 109,239 | 262,137 |
| May | 182,770 | 319,845 |
| June | 292,006 | 386,795 |
| July | 323,352 | 440,106 |
| August | 337,878 | 452,439 |
| September | 265,268 | 449,294 |
| October | 224,038 | 422,058 |
| November | 99,082 | 257,401 |
| December | 28,112 | 112,463 |
This table lets us compare the count for each month of the year, but since we know member rides occur far more frequently overall, it’s difficult to directly compare the two groups.
To address this, we’ll recreate the chart using monthly percentages, calculated as the proportion of each rider type’s total rides in each month. This normalizes the data and provides a clearer basis for comparison.
monthly_counts %>%
mutate(
casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
) %>%
select(month, casual, member) %>%
kable(
col.names = c("Month", "Casual Rides", "Member Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Month | Casual Rides | Member Rides |
|---|---|---|
| January | 24,124 (1.2%) | 114,527 (3.2%) |
| February | 27,757 (1.4%) | 124,144 (3.5%) |
| March | 85,862 (4.3%) | 212,268 (6.0%) |
| April | 109,239 (5.5%) | 262,137 (7.4%) |
| May | 182,770 (9.1%) | 319,845 (9.0%) |
| June | 292,006 (14.6%) | 386,795 (10.9%) |
| July | 323,352 (16.2%) | 440,106 (12.4%) |
| August | 337,878 (16.9%) | 452,439 (12.7%) |
| September | 265,268 (13.3%) | 449,294 (12.6%) |
| October | 224,038 (11.2%) | 422,058 (11.9%) |
| November | 99,082 (5.0%) | 257,401 (7.2%) |
| December | 28,112 (1.4%) | 112,463 (3.2%) |
Using percentages gives us a better way to compare and we can quickly see that there are some obvious differences between the percentages of casual and member rides for some months.
Let’s plot the monthly percentages on a grouped bar chart to give us an even easier way to spot any trends between rider types.
month_pct <- tripdata_2025_cleaned %>%
count(month, member_casual) %>%
group_by(member_casual) %>%
mutate(pct = n / sum(n)) %>%
ungroup()
ggplot(month_pct, aes(x = month, y = pct, fill = member_casual)) +
geom_col(position = position_dodge(width = 0.8), width = 0.7) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 1),
expand = expansion(mult = c(0, 0.05))
) +
scale_fill_manual(values = ride_type_colors) +
labs(
x = "Month",
y = "Percentage of Rides",
fill = NULL
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
legend.position = c(0.075, 0.80),
legend.justification = c(0, 1),
legend.direction = "vertical",
legend.text = element_text(size = 11),
legend.key.spacing.y = unit(8, "pt"),
legend.background = element_rect(
fill = scales::alpha("white", 0.8),
color = NA
),
panel.grid.minor = element_blank()
)
The chart shows clear seasonal differences in rider behavior. A larger
proportion of casual rides occur during the summer months of
June, July, and
August, while a larger proportion of member
rides occur during the winter months of December,
January, and February.
To better summarize these seasonal patterns, we can group months into broader season periods and compare the proportion of rides taken by each rider type.
tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
mutate(
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Fall"
),
season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
)
# Pivot wider and create kable
tripdata_2025_cleaned %>%
count(season, member_casual) %>%
pivot_wider(
names_from = member_casual,
values_from = n
) %>%
arrange(season) %>%
mutate(
casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
) %>%
select(season, casual, member) %>%
kable(
col.names = c("Season", "Casual Rides", "Member Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Season | Casual Rides | Member Rides |
|---|---|---|
| Winter | 79,993 (4.0%) | 351,134 (9.9%) |
| Spring | 377,871 (18.9%) | 794,250 (22.4%) |
| Summer | 953,236 (47.7%) | 1,279,340 (36.0%) |
| Fall | 588,388 (29.4%) | 1,128,753 (31.8%) |
This makes it very clear how big a difference there is in what time of year the different groups ride. The percentages of summer and winter rides show the largest differences. While winter casual rides show more than double the percentage of member rides, it still represents a relatively small proportion compared to summer, which accounts for nearly half of all casual rides and over a third of member rides.
Next, we check how many rides on each day of the week, comparing
member vs casual.
Hypothesis: Member rides will be more frequent on
weekdays, while casual rides will peek on weekends. This likely
reflects members using the bikes for their daily commute, while
casual riders use them more for recreation. Since we know the
total counts are so different, we’ll include percentages.
day_of_week_counts <- tripdata_2025_cleaned %>%
count(day_of_week, member_casual) %>%
pivot_wider(
names_from = member_casual,
values_from = n
) %>%
arrange(day_of_week)
day_of_week_counts %>%
mutate(
casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
) %>%
select(day_of_week, casual, member) %>%
kable(
col.names = c("Day of the Week", "Casual Rides", "Member Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Day of the Week | Casual Rides | Member Rides |
|---|---|---|
| Sunday | 331,783 (16.6%) | 382,293 (10.8%) |
| Monday | 228,253 (11.4%) | 502,767 (14.1%) |
| Tuesday | 225,586 (11.3%) | 563,070 (15.8%) |
| Wednesday | 221,538 (11.1%) | 550,154 (15.5%) |
| Thursday | 258,045 (12.9%) | 576,005 (16.2%) |
| Friday | 320,077 (16.0%) | 528,989 (14.9%) |
| Saturday | 414,206 (20.7%) | 450,199 (12.7%) |
We can quickly see that a much higher percentage of casual rides are on the weekends, whereas member rides tend to be during the week.
We can plot these day of the week percentages on a line chart to compare rider types.
# Calculate percentages and store in a new tibble for later use
day_of_week_percent <- tripdata_2025_cleaned %>%
count(day_of_week, member_casual) %>%
group_by(member_casual) %>%
mutate(percentage = n / sum(n) * 100) %>%
ungroup()
last_points <- day_of_week_percent %>%
group_by(member_casual) %>%
filter(day_of_week == max(day_of_week))
# Create a line chart using the stored variable
ggplot(day_of_week_percent, aes(x = day_of_week, y = percentage, color = member_casual, group = member_casual)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
scale_color_manual(values = ride_type_colors) +
scale_y_continuous(limits = c(10, 21), breaks = seq(10, 21, 2)) +
labs(title = "Percentage of Rides by Day of Week and Rider Type",
x = NULL, y = "Percentage of Rides", color = "Rider Type") +
theme_minimal() +
theme(
legend.position = "none",
axis.text.x = element_text(size = 11),
axis.title.y = element_text(margin = margin(r = 8))
) +
geom_text(
data = last_points,
aes(label = member_casual),
hjust = -0.2,
vjust = 0.22,
size = 5
) +
coord_cartesian(clip = "off")
This clearly shows how different the curves are.
Casual rides are more prevalent on weekends, dipping lower
mid-week.
Member rides are lowest on the weekends, with the most rides occurring
on Tuesday, Wednesday, and Thursday.
We’ll now look for any differences in the time of day between
casual and member rides. We’ll use the
started_at date/time and break the hours of the day into eight
three-hour groups. Eight provides a good balance between keeping our
analysis uncluttered with too many buckets, but still splitting the day
into enough distinct time periods.
Hypothesis: Member rides will tend to be most frequent
during rush hour times (7am-10am and 3pm-6pm), while casual
rides will be higher in the middle of the day. This again would reflect
members using the bikes for their commute and casual
riders using them for recreation.
tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
mutate(
# Extract the ride starting hour (0-23)
start_hour = hour(started_at),
# Create the time intervals
start_time_bucket = cut(
start_hour,
breaks = seq(0, 24, by = 3),
right = FALSE,
labels = c("Midnight-3am", "3am-6am", "6am-9am", "9am-Noon", "Noon-3pm", "3pm-6pm", "6pm-9pm", "9pm-Midnight")
)
)
# Pivot wider and create kable
tripdata_2025_cleaned %>%
count(start_time_bucket, member_casual) %>%
pivot_wider(
names_from = member_casual,
values_from = n
) %>%
arrange(start_time_bucket) %>%
mutate(
casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
) %>%
select(start_time_bucket, casual, member) %>%
kable(
col.names = c("Time of Day", "Casual Rides", "Member Rides")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Time of Day | Casual Rides | Member Rides |
|---|---|---|
| Midnight-3am | 80,722 (4.0%) | 64,470 (1.8%) |
| 3am-6am | 27,987 (1.4%) | 51,043 (1.4%) |
| 6am-9am | 146,638 (7.3%) | 557,280 (15.7%) |
| 9am-Noon | 266,031 (13.3%) | 479,728 (13.5%) |
| Noon-3pm | 399,600 (20.0%) | 564,248 (15.9%) |
| 3pm-6pm | 520,540 (26.0%) | 953,078 (26.8%) |
| 6pm-9pm | 373,521 (18.7%) | 642,121 (18.1%) |
| 9pm-Midnight | 184,449 (9.2%) | 241,509 (6.8%) |
We do see some differences here, but only in a few timespans, notably 6am-9am where member rides are much higher and noon-3pm where casual rides are much higher. Much of the other sections are very close.
Let’s plot this out on another grouped bar chart to visualize where the differences are.
time_of_day_pct <- tripdata_2025_cleaned %>%
count(start_time_bucket, member_casual) %>%
group_by(member_casual) %>%
mutate(pct = n / sum(n)) %>%
ungroup()
ggplot(time_of_day_pct, aes(x = start_time_bucket, y = pct, fill = member_casual)) +
geom_col(position = position_dodge(width = 0.8), width = 0.7) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 1),
expand = expansion(mult = c(0, 0.05))
) +
scale_fill_manual(values = ride_type_colors) +
labs(
x = "Time of Day",
y = "Percentage of Rides",
fill = NULL
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
legend.position = c(0.075, 0.80),
legend.justification = c(0, 1),
legend.direction = "vertical",
legend.text = element_text(size = 11),
legend.key.spacing.y = unit(8, "pt"),
legend.background = element_rect(
fill = scales::alpha("white", 0.8),
color = NA
),
panel.grid.minor = element_blank()
)
We can see from the bar chart that several time periods have very
similar ride distributions between rider types:
3am-6am, 9am-noon,
3pm-6pm, and 6pm-9pm.
Casual rides make up a higher percentage from
midnight-3am, noon-3pm, and
9pm-midnight, suggesting greater use outside of
traditional commuter hours.
In contrast, the percentage of member rides from
6am-9am is more than double that of casual
rides, strongly suggesting commuter-driven usage during morning rush
hours.
Overall, this partially supports our hypothesis. The
6am-9am window aligns with typical morning commute
hours, while late-night and midday use
would be more consistent with recreational rides. The
noon-3pm time window shows similar high usage for both
rider types, which may reflect an overlap of the evening commute with
recreational riding.
To start analyzing the ride length (duration) for each rider type, let’s look at the mean, median, min, and max.
tripdata_2025_cleaned %>%
group_by(member_casual) %>%
summarize(
mean_ride_length = mean(ride_length),
median_ride_length = median(ride_length),
min_ride_length = min(ride_length),
max_ride_length = max(ride_length)
) %>%
kable(
digits = 2,
format.args = list(big.mark = ",", nsmall = 2),
col.names = c("Rider Type", "Mean (mins)", "Median (mins)", "Min (mins)", "Max (mins)")
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
| Rider Type | Mean (mins) | Median (mins) | Min (mins) | Max (mins) |
|---|---|---|---|---|
| casual | 22.60 | 11.41 | 0.00 | 1,574.90 |
| member | 12.33 | 8.58 | 0.00 | 1,499.97 |
Casual riders seem to take longer rides, nearly double on average.
We can take a look at the ride length data on a density plot here. We limit the chart to 50 to get a better view of the highest density areas, because there is a long tail to the right, which would continue to over 1,500 minutes.
tripdata_2025_cleaned %>%
ggplot(aes(x = ride_length, fill = member_casual)) +
geom_density(alpha = 0.55) +
scale_x_continuous(limits = c(0, 50), name = "Ride Length (mins)") +
ylab("Density") +
ggtitle("Ride Length Distribution by Rider Type") +
scale_fill_manual(values = ride_type_colors) +
theme_minimal() +
theme(
legend.position = c(0.65, 0.65),
legend.key.spacing.y = unit(0.25, "lines"),
legend.background = element_rect(linewidth = 0.7),
legend.title = element_blank(),
axis.title.x = element_text(margin = margin(t = 8)),
axis.title.y = element_text(margin = margin(r = 8))
)
This shows us that members tend to take shorter rides, as the curve
peaks higher and slightly earlier.
Casual riders are more likely to take longer rides, as the casual curve
has a heavier right tail.
Both rider types exhibit similarly shaped distributions, with notable
overlap between 5 and 10 minutes.
Since we have two types of bikes in our data, we can compare which bike type was more preferred on member and casual rides. Again, since the totals for ride type are so different, we’ll add percentages in to make it easier to compare.
rideable_counts <- tripdata_2025_cleaned %>%
mutate(rideable_type = recode(
rideable_type,
"classic_bike" = "Classic Bike",
"electric_bike" = "Electric Bike"
)) %>%
count(rideable_type, member_casual) %>%
pivot_wider(
names_from = member_casual,
values_from = n
)
rideable_counts %>%
mutate(
casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
) %>%
select(rideable_type, casual, member) %>%
kable(
col.names = c("Rideable Type", "Casual Rides", "Member Rides"),
) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
| Rideable Type | Casual Rides | Member Rides |
|---|---|---|
| Classic Bike | 672,670 (33.6%) | 1,275,359 (35.9%) |
| Electric Bike | 1,326,818 (66.4%) | 2,278,118 (64.1%) |
Both casual and member riders show a similar distribution of bike types, favoring electric by some margin. While casual riders have a slightly higher proportion of rides on electric bikes, the difference is minimal (~2.3%) and unlikely to have a meaningful impact on our analysis.
While our earlier analyses explored temporal differences between casual and member riders, we can also examine *where** rides begin and end. By mapping each station and coloring it based on the proportion of casual vs member rides, we can see how usage patterns vary across the city.
station_usage <- tripdata_2025_cleaned %>%
group_by(start_station_name, start_lat, start_lng) %>%
count(member_casual) %>%
pivot_wider(names_from = member_casual, values_from = n, values_fill = 0) %>%
mutate(
total = casual + member,
pct_casual = casual / total,
pct_member = member / total,
start_station_name = ifelse(is.na(start_station_name), "", start_station_name),
# log-based marker size (min 3px, max 20px)
marker_radius = scales::rescale(log1p(total), to = c(3, 20))
)
library(leaflet)
pal <- colorNumeric(
palette = colorRampPalette(c(ride_type_colors["member"], ride_type_colors["casual"]))(20),
domain = station_usage$pct_casual
)
leaflet(station_usage) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(
lng = ~start_lng,
lat = ~start_lat,
radius = ~marker_radius,
color = ~pal(pct_casual),
stroke = FALSE,
fillOpacity = 0.85,
label = ~lapply(
paste0(
ifelse(start_station_name == "" | is.na(start_station_name), "", paste0(start_station_name, "<br>")),
"Total rides: ", total, "<br>",
"Casual: ", sprintf("%.1f%%", 100 * pct_casual), "<br>",
"Member: ", sprintf("%.1f%%", 100 * pct_member)
),
htmltools::HTML
)
) %>%
addLegend(
pal = pal,
values = ~pct_casual,
title = "% Casual Rides",
labFormat = labelFormat(suffix = "%", transform = function(x) 100 * x)
)